GDAS017-使用Bioconductor进行高性能计算

性能增强概览

软件系统和数据分析的可扩展性(scalability)的概念是复杂的，精确的度量(precise metrics)和标准现在还不好处理。可以参考一下来自计算机界的 Weinstock and Goodenough 的一个概述。基于Bioconductor的分析工作流的可扩展性(scalability)可以沿着不同的路径进行，Bioconductor的两个核Mike Lawrence和Martin Morgan在工作就提到了许多有意的想法。

本文档仅解决Bioconductor直接支持的并行计算，主要通过两种方法实现：

共享内存并行性。使用内存映像的完整副本对R进程进行任意次数的分叉，并且对每个映像独立地进行计算。选定的结果将返回到主进程。可以有效地在多环境中实现。
分布式并行性。可以运行不同操作系统的独立处理器可以运行R的兼容实例。R或集群调度程序可以执行作业控制(job control)。

简要说明

现在我们来说明一下共享内存方法：

system.time( lapply(1:8, function(x)Sys.sleep(1) ) )
##    user  system elapsed 
##   0.004   0.001   8.020
library(parallel)
detectCores()
## [1] 4
options(mc.cores=4)
system.time( mclapply(1:8, function(x)Sys.sleep(1) ) )
##    user  system elapsed 
##   0.007   0.022   2.031

上面只是一个无意义的计算，我们就是为了说明并行计算的原理，通过并行计算，我们加快了计算速度，计算过程的时间缩短为原来的四分之一。

RNA-seq实验中的BAM文件

我们将使用 Zarnack et al. 2013 等的人研究中的一些BAM文件。2013年的这项假设是，有一类蛋白质，即异质核糖蛋白C1和C2，也就是指的hnRNP C(heterogeneous nuclear ribonucleoproteins C1 and C2)，这个蛋白会阻止Alu因子整合到成熟到转录本上。这样的整合会导致蛋白功能障碍，进而导致疾病；如果想了解更深入的信息，可以去查看文献原文。

我们下面所使用的包含有8个BAM文件，它们已经与chr14染色体进行了比对。其中4个文件是正常的HeLa细胞的读长文件(reads)（阴性对照），另外4个文件是敲低了hnRNP C后的HeLa的读长文件，如下所示：

library(RNAseqData.HNRNPC.bam.chr14)
dir(system.file("extdata", package="RNAseqData.HNRNPC.bam.chr14"))
##  [1] "ERR127302_chr14.bam"     "ERR127302_chr14.bam.bai"
##  [3] "ERR127303_chr14.bam"     "ERR127303_chr14.bam.bai"
##  [5] "ERR127304_chr14.bam"     "ERR127304_chr14.bam.bai"
##  [7] "ERR127305_chr14.bam"     "ERR127305_chr14.bam.bai"
##  [9] "ERR127306_chr14.bam"     "ERR127306_chr14.bam.bai"
## [11] "ERR127307_chr14.bam"     "ERR127307_chr14.bam.bai"
## [13] "ERR127308_chr14.bam"     "ERR127308_chr14.bam.bai"
## [15] "ERR127309_chr14.bam"     "ERR127309_chr14.bam.bai"
## [17] "tophat2_out"

这些都是由Tabix建立索引的BAM文件。其中*.bam 是原始的BAM文件，而*.bam.bai则是添加了索引后的BAM文件，从上面我们可以看到有8个bam.bai文件，前面四个是敲低组，后面四个是对照组。

使用BiocParallel实现隐式并行

为了促进可靠的执行程序的和谐发展，Bioconductor开发了BiocParallel包。在下面的案例中，我们将读取HNRNP C案例数据集中的reads读取到由外显子地址定义的bins中（这里我不清楚bin是什么，列出了原文）：

Consider the following example: we will count reads into bins defined by exon addresses in the HNRNPC example dataset.

RNAseqData.HNRNPC.bam.chr14 这个包含有BAM文件的绝对路径的一个向量，我将其分配给fns，如下所示：

library(RNAseqData.HNRNPC.bam.chr14)
fns = RNAseqData.HNRNPC.bam.chr14_BAMFILES
length(fns)
## [1] 8

Here we establish the exon bins into which we will count reads.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevels(txdb) = "chr14"
ebg = exonsBy(txdb, by="gene")

Now we use the summarizeOverlaps function from the GenomicAlignments package to count reads into the exon bins. We’ll time the counting from a single file, and then time the counting from all files at once.

library(GenomicAlignments)
# summarizeOverlaps uses bplapply to iterate over files
s1 = system.time(i1 <- summarizeOverlaps( ebg, fns[3] ))
s1
##    user  system elapsed 
##   2.975   0.178   3.178
# show implicit config
BiocParallel::bpparam()
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 4; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bptimeout: 2592000; bpprogressbar: FALSE
##   bpRNGseed: 
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
s2 = system.time(i2 <- summarizeOverlaps( ebg, fns ))
s2
##    user  system elapsed 
##   0.271   0.162  10.480

This is not a thorough way of measuring speedup but it shows reasonable enhancement. In the second computation, we did approximately 8 times as much computation, but the clock time elapsed increased only by a factor of (3.3). The default behavior of BiocParallel on the MacBook Air used to produce this document is to pick up the value of the option mc.cores and use this as the number of workers in a MulticoreParam configuration; if options()$mc.cores is NULL, 2 workers are specified.

When BiocParallel is attached, the summarizeOverlaps function will iterate over the files using bplapply from the BiocParallel package. That function will check the R session for specific parallelization configuration information. If it finds none, it will check for multiple cores and make arrangements to use them if present. The “check” occurs via the function bpparam.

The default situation on a MacBook Air running MacOSX 10.9.5 with an Intel core i7 processor, (two physical cores with two logical cores each, allowing for four concurrent threads) as follows.

options(mc.cores=NULL)
library(BiocParallel)
register(MulticoreParam())
bpparam()
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bptimeout: 2592000; bpprogressbar: FALSE
##   bpRNGseed: 
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

This identifies an object called MulticoreParam which is used to configure the behavior of bplapply and other utilities of the BiocParallel package. There are various configuration object classes that can be used.

‘SnowParam’: distributed memory computing
‘MulticoreParam’: shared memory computing
‘BatchJobsParam’: scheduled cluster computing
‘DoparParam’: foreach computing
‘SerialParam’: non-parallel execution

We need to use register to determine the type of concurrent computation that will be performed.

If process size is large, we may want to leave some cores idle. We can accomplish that by using register. Here we’ll check for any advantage of using all logical cores.

library(BiocParallel)
register(MulticoreParam(workers=1))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##  15.653   1.286  17.070
register(MulticoreParam(workers=2))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##   0.103   0.029   8.818
register(MulticoreParam(workers=3))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##   0.051   0.023   9.543
register(MulticoreParam(workers=4))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##   0.056   0.026   8.774
all.equal(i3,i2)  # check that the results do not change
## [1] TRUE

Note that in this environment, despite increasing the number of CPUs linearly do not appreciably reduce run time after moving to two cores. This due mainly to communication costs that typically increase with the number of CPUs, and this phenomenon will be environment-specific.

In summary, it is very easy to perform embarrassingly parallel tasks with R, and this carries over to genomic data analysis thanks to BiocParallel. There are some strategic considerations concerning control of memory consumption and communication costs, and full mastery of the topic involves attention to profiling and benchmarking methods that can be addressed in an advanced software development course.

参考资料

Architecture: considerations on high performance computing with Bioconductor